Investigations on linear transformations for speaker adaptation and normalization

نویسنده

  • Michael Pitz
چکیده

This thesis deals with linear transformations at various stages of the automatic speech recognition process. In current state-of-the-art speech recognition systems linear transformations are widely used to care for a potential mismatch of the training and testing data and thus enhance the recognition performance. A large number of approaches has been proposed in literature, though the connections between them have been disregarded so far. By developing a unified mathematical framework, close relationships between the particular approaches are identified and analyzed in detail. Mel frequency Cepstral coefficients (MFCC) are commonly used features for automatic speech recognition systems. The traditional way of computing MFCCs suffers from a twofold smoothing, which complicates both the MFCC computation and the system optimization. An improved approach is developed that does not use any filter bank and thus avoids the twofold smoothing. This integrated approach allows a very compact implementation and needs less parameters to be optimized. Starting from this new computation scheme for MFCCs, it is proven analytically that vocal tract normalization (VTN) equals a linear transformation in the Cepstral space for arbitrary invertible warping functions. The transformation matrix for VTN is explicitly calculated exemplary for three commonly used warping functions. Based on some general characteristics of typical VTN warping functions, a common structure of the transformation matrix is derived that is almost independent of the specific functional form of the warping function. By expressing VTN as a linear transformation it is possible, for the first time, to take the Jacobian determinant of the transformation into account for any warping function. The effect of considering the Jacobian determinant on the warping factor estimation is studied in detail. The second part of this thesis deals with a special linear transformation for speaker adaptation, the Maximum Likelihood Linear Regression (MLLR) approach. Based on the close interrelationship between MLLR and VTN proven in the first part, the general structure of the VTN matrix is adopted to restrict the MLLR matrix to a band structure, which significantly improves the MLLR adaptation for the case of limited available adaptation data. Finally, several enhancements to MLLR speaker adaptation are discussed. One deals with refined definitions of regression classes, which is of special importance for fast adaptation when only limited adaptation data are available. Another enhancement makes use of confidence measures to care for recognition errors that decrease the adaptation performance in the first pass of a two-pass adaptation process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation

A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...

متن کامل

Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation

A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...

متن کامل

Knowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition

In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF...

متن کامل

Investigating Explicit Model Transformations for Speaker Normalization

In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF...

متن کامل

Combining Vocal Tract Length Normalization with Linear Transformations in a Bayesian Framework

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLRbased adaptation techniques, being much closer in quality to that generated by the original average voice model. By contrast, with just a single parameter, VTLN c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005